239 research outputs found

    Client-Driven Content Extraction Associated with Table

    Get PDF
    The goal of the project is to extract content within table in document images based on learnt patterns. Real-world users i.e., clients first provide a set of key fields within the table which they think are important. These are first used to represent the graph where nodes are labelled with semantics including other features and edges are attributed with relations. Attributed relational graph (ARG) is then employed to mine similar graphs from a document image. Each mined graph will represent an item within the table, and hence a set of such graphs will compose a table. We have validated the concept by using a real-world industrial problem

    Handwritten and Printed Text Separation in Real Document

    Get PDF
    The aim of the paper is to separate handwritten and printed text from a real document embedded with noise, graphics including annotations. Relying on run-length smoothing algorithm (RLSA), the extracted pseudo-lines and pseudo-words are used as basic blocks for classification. To handle this, a multi-class support vector machine (SVM) with Gaussian kernel performs a first labelling of each pseudo-word including the study of local neighbourhood. It then propagates the context between neighbours so that we can correct possible labelling errors. Considering running time complexity issue, we propose linear complexity methods where we use k-NN with constraint. When using a kd-tree, it is almost linearly proportional to the number of pseudo-words. The performance of our system is close to 90%, even when very small learning dataset where samples are basically composed of complex administrative documents.Comment: Machine Vision Applications (2013

    Morphological Tagging Approach in Document Analysis of Invoices

    Get PDF
    International audienceIn this paper a morphological tagging approach for document image invoice analysis is described. Tokens close by their morphology and confirmed in their location within different similar contexts make apparent some parts of speech representative of the structure elements. This bottom up approach avoids the use of an priori knowledge provided that there are redundant and frequent contexts in the text. The approach is applied on the invoice body text roughly recognized by OCR and automatically segmented. The method makes possible the detection of the invoice articles and their different fields. The regularity of the article composition and its redundancy in the invoice is a good help for its structure. The recognition rate of 276 invoices and 1704 articles, is over than 91.02% for articles and 92.56% for fields

    Form Analysis by Neural Classification of Cells

    Get PDF
    The original publication is available at www.springerlink.com/www.springerlink.comOur aim in this paper is to present a methodology for linearly combining multi neural classifier for cell analysis of forms. Features used for the classification are relative to the text orientation and to its character morphology. Eight classes are extracted among numeric, alphabetic, vertical, horizontal, capitals, etc. Classifiers are multi-layered perceptrons considering firstly global features and refining the classification at each step by looking for more precise features. The recognition rate of the classifiers for 3. 500 cells issued from 19 forms is about 91 %

    A Short Tour OCR, ICR, DIA

    Get PDF
    Colloque sur invitation. internationale.International audienceThis presentation gives a state of the art in the domains of OCR, ICR and DIA

    Arabic natural language processing: handwriting recognition

    Get PDF
    International audienceThe automatic recognition of Arabic writing is a very young research discipline with very challenging and significant problems. Indeed, with the air of the Internet, of Multimedia, the recognition of Arabic is useful to contributing like its close disciplines, Latin writing recognition, speech recognition and Vision processing, in current applications around digital libraries, document security and in numerical data processing in general. Arabic is a Semitic language spoken and understood in various forms by millions of people throughout the Middle East and in Africa, and it is used by 234 million people worldwide. Furthermore, Arabic gave rise to several other alphabets like Farsi or Urdu increasing much the interest of this script. Farsi is the main language used in Iran and Afghanistan, and it is spoken by more than 110 million people, concerning also some people in Tajikistan, and Pakistan. Urdu is an Indo-Aryan language with about 104 million speakers. It is the national language of Pakistan and is closely related to Hindi, though a lot of Urdu vocabulary comes from Persian and Arabic, which is not the case for Hindi. Urdu has been written with a version of the Perso-Arabic script since the 12th century and is normally written in Nastaliq style

    Digital Library and Document Server

    Get PDF
    Colloque sur invitation. internationale.International audienceThis presentation describes the project MORE, developped in collaboration of JOUVE and the Royal Library ALBERT I. We propose a parser for library records allowing to restructure them electronically

    A case-based reasoning approach for unknown class invoice processing

    Get PDF
    International audienceThis paper introduces an invoice analysis approach using Case-Based Reasoning (CBR). CBR is used to analyze and interpret new invoices thanks to the previous processing experiences. Each new document is segmented into structures and interpreted thanks to a structure database. Interpreting a new document's structures relies on graph edit distance as well as on string edit distance. This paper focuses on document structure extraction as well as on document interpretation via its structures interpretation. The proposed system reaches an extraction and interpretation rate of 76.33%

    Use of PGM for Form recognition

    Get PDF
    ISBN : 978-1-4673-0868-7International audienceThis paper addresses the use of PGM (Probabilistic Graphical Model) for form model identification from just few items filled up by an electronic pen. Only the electronic ink is sent to the system without any indication on the form model. Two applications are made in this study: one is related to keynote form classification from its filled fields, while the second application concerns a design modelling problem for the on-line configuration of shower areas. In the former, only indications on the filled fields are sent to the system, while in the latter, the designer send strokes corresponding to the elements designed on the form model. In this application a unique form is proposed to the user to fill up the configuration of his shower area. The PGM is exploited advantageously in both cases translating precisely the relationships between corresponding elements in conditional probabilities, from individual elements up to the complete model constitution

    Named Entity Recognition by Neural Prediction

    Get PDF
    International audienceNamed entity recognition (NER) remains a very challenging problem essentially when the document, where we perform it, is handwritten and ancient. Traditional methods using regular expressions or those based on syntactic rules, work but are not generic because they require, for each dataset, additional work of adaptation. We propose here a recognition method by context exploitation and tag prediction. We use a pipeline model composed of two consecutive BLSTMs (Bidirectional Long-Short Term Memory). The first one is a BLSTM-CTC coupling to recognize the words in a text line using a sliding window and HOG features. The second BLSTM serves as a language model. It cleverly exploits the gates of the BLSTM memory cell by deploying some syntactic rules in order to store the content around the proper nouns. This operation allows it to predict the tag of the next word, depending on its context, which is followed gradually until the discovery of the named entity (NE). All the words of the context are used to help the prediction. We have tested this system on a private dataset of Philharmonie de Paris, for the extraction of proper nouns within sale music transactions as well as on the public IAM dataset. The results are satisfactory, compared to what exists in the literature
    corecore